XNOR-Net: Imagenet Classification Using Binary Convolutional Neural Networks
39
The XNOR-Net binarization approach seeks to identify the most accurate convolutional
approximations. Specifically, XNOR-Net employs a scaling factor, which plays a vital role
in the learning of BNNs, and improves the forward pass of BNNs as:
an
out ≈αn ◦(bwn ⊙ban
in),
(3.3)
where αn = {αn
1, αn
2, ..., αn
Cn
out} ∈RCn
out
+
is known as the channel-wise scaling factor vector
to mitigate the output gap between Eq. (3.1) and its approximation of Eq. (3.3). We denote
A = {αn}N
n=1. Since the weight values are binary, XNOR-Net can implement the convolu-
tion with additions and subtractions. In the following, we state the XNOR operation for a
specific convolution layer, thus omitting the superscript n for simplicity. Most existing im-
plementations simply follow earlier studies [199, 159]to optimize A based on non-parametric
optimization as:
α∗, bw∗= arg min
α,bw
J(α, bw),
(3.4)
J(α, bw) = ∥w −αn ◦bw∥2
2.
(3.5)
By expanding Eq. 3.5, we have:
J(α, bw) = α2(bw)Tbw −2α ◦wTbw + wTw
(3.6)
where bw ∈B. Thus, (bw)Tbw = Cin × K × K. wTw is also a constant due to w being a
known variable. Thus, Eq. 3.6 can be rewritten as:
J(α, bw) = α2 × Cin × K × K −2α ◦wTbw + constant.
(3.7)
The optimal solution can be achieved by maximizing the following constrained optimization:
bw∗= arg max
bw
wTbw, s.t.
bw ∈B,
(3.8)
which can be solved by the sign function:
bwi =
+1
wi ≥0
−1
wi < 0
which is the optimal solution and is also widely used as a general solution to BNNs in the
following numerous works [159]. To find the optimal value for the scaling factor α∗, we take
the derivative of J(·) w.r.t. α and set it to zero as:
α∗=
wTbw
Cn
in × Kn × Kn .
(3.9)
By replacing bw with the sign function, we have that a closed-form solution of α can be
derived via the channel-wise absolute mean (CAM) as:
αi =
∥wi,:,:,:∥1
Cin × K × K
(3.10)
αi = ∥wi,:,:,:∥1
M
. Therefore, the optimal estimation of a binary weight filter can be achieved
simply by taking the sign of weight values. The optimal scaling factor is the average of the
absolute weight values.
Based on the explicitly solved α∗, the training objective of the XNOR-Net-like BNNs
is given in a bilevel form:
W∗= arg min
W
L(W; A∗),
s.t. arg min
αn,bwn J(α, bw),
(3.11)
which is also known as hard binarization [159]. In the following, we show some variants of
such a binarization function.